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ABSTRACT 

An approach for clearer observation of differences 
vhen evaluating educational programs is presented. The standardized 
tests utilized in evaluating programs are designed to measure topics 
commonly taught and to maximize individual differences. This masks 
betveen-program differences and the unique aspects of different 
programs. Analysis of item difficulties for an achievement test 
relative to the program mean difficulty assists in identifying 
program strengths and weaknesses. The correlations of programs (as 
variables) using item difficulties as observations indicate the 
degree of communality between programs. These techniques should 
assist in evaluation of innovative educational programs. (Author) 



ERLC 



CO 

o 
o 



U S DEPARTMENT OP HEALTH. 
EDUCATIONS WELFARE 
NATIONAL INSTITUTE OF 
EDUCATION 

TMtS DOCUMENT MAS BEEN OEPRO 
OUCEO eXACTLV AS RECEtVEO P«OM 
THE PERSON OR 0«GANt2ATiCN ORIGIN 
ATiNG IT POINTS or VIEW 0« OPINIONS 
STATEO 00 NOT NECESSARH.V REPCeE 
SENT OFCiciAL NATIONAL iNSTlTuTEO^ 
EOUCAT»ON POSITION OR POLICY 



7*6 <f 
P.i 



UNiXmkTK^w 0? 3Duz\Tio:i\L T?^i3z\l: differed:: 
vui AC}iisr.^/i:j:r tsst itk:: diffiotitiss* 

Ernest A# Rakow, Boston College 



BEST COPY AVAILABLE 



hurposft 

The purpose* of this paper is to present a unique approach for more clearly 
observing- dlff^.rencciS when evaluating educational programs. Generally, 
standardized achievement tests are the central instrument utilized in eval- 
uation of educational programs such as Title I evaluations, Project Follow- 
Throu^^h, and Jquality of i^ducational Opportunity. Standardized achieve- 
ment tests are designed to maxiraize individual differences and measure 
then reliably. Such tests measure topics taught in many educational pro- 
grans and avoid those topics vhich occur in few educational prograras. 
These test5 are refin^rd via statistical amlysis o.f thb itenis usin^ the 
iten diific\ilty an:^ discrimination. These statistical procedures are 
applied to large 5a:nples of students from nany educational programs to 
maximize the reliability of the 'neasurement of individual differences. 
This also tftnds to increase measurement of conmon topics and avoidance of 
unique topics. Consequently, vhile standardized tests provide excellent 
measures of individual differences on comon topics, they may not be appro- 
priate for progra!?2 evaluation. Frogra,Ti evaluation should focus on dif- 
fer^^nces betx-yeen programs, i.e. the v;ays in which particular programs are 
unique, as well as indicating adequacy on the common topics. 
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iiftasurenant of the unique aspects of an educational program would require 
testing procedures which could identify honogenous performance within a 
particular progra^i but heterogeneous performance betx^^een programs. This 
paper presents an approach for further inspection of achievement test data 
when evaluating programs. 

Kf>thod 

The techniques presented here are further analyses of item difficulties 
for it^rns of a norm-referenced achievement test. One could view each item 
as another observation taken on each of the profjrans. The mean of the 
item difficulties for each program could be e:i:amined to observe overall 
differences in level of achicvenont. This would yield the same conclusions 
as evaluating programs via the means of the test scores. Programs could 
also be considered as variables enabling the calculation of correlation 
coefficients of these observations (items) on programs (variables). If 
there are no program differences, otht&r than level of achieveraent, these 
correlations should all be approximately equal and their magnitude sljould 
be close to that of the reliability of the test, however, if some corre- 
lations are considerably lower than thf^t test reliability this is a clear 
indication of program uniquenesses. If there are no unique program effects 
then the difficulty of an item relative to the mean difficulty for a program 
should he approximately equal in all programs. If there are program dif- 
ferences, then an item which measures a urdque aspects of the program should 
be of greater relative difficulty in one program than in another. It should 
be noted that this approach concentrates on performance relative to the mean 
of that program, not on the overall level of performance as reflected in 
program moans. 

• 

If correlations significantly below the reliability coefficient are observed, 
one co\ild proceed by further evaluating the iteir. difficulties. One could 
•calculate the intraclass correlation for each item between the programs. 
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The itens vdth higher inlraclass correlations indicate items neasuring 
greater program differences. One could also exainine the item difficulties, 
searcliing for items which may cause the lower correlation between programs. 

Data 

The data for this paper are the item difficulties for a 69 item mathematics 
achieveffient test administered to mathematics students in their final year 
of secondary school in twelve countries. There are twelve item, difficulty 
estimates for each item, one for each country included in the sample. 
These iteni difficulties were published in a bulletin by the International 
Association for the Evaluation on liducational Achievement (lEA). 

Results 

The results of applying these procedures to the data from the twelve coun- 
tries are interesting. First of all, there are significant differences 
in the mean level of achievement (as was reported by ISA). But that is 
not the point of this papsr. 

Treating the item difficulties as the observations on twelve countries 
a correlation matrix was calculated. These correlations are given in Table 
1. At the bottom of this table the country means arid reliabilities are 
also given. The median reliability for this test in these countries is 
0.88. 

Excluding the main diagonal, the median correlation in this matrix is 
0.70, which is significantly below the lowest reliability (.79). Only 
six correlations are larger than the lower of the two reliability estimates 
for the corresponding pair of countries. Three of these high correlations 
are. for the countries of England, Scotland and Australia (the only countries 
of the British Commonvrealth included here). These high correlations indi- 
cate similar patterns of item difficulties relative to the country means^ 
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One explanation for thin could b© similarities in the educational system 
and especially in the emphesis of the mathematics ciirriculuiri. ibte that 
this is a very different interpretation than suggested by the overall means, 
(England is high v;liile Australia is low.) It is also granted that a cow- 
peting explanation for these high correlations could be the cultural and 
social simiLarities of these countries. The other three high correlations 
are for the countries of Holland, Sweden and Finland. Once again, the 
two co!ipoting explanations are (l) cultural and social similarities or 
(2) similarities in the educational system and in the enphesec in the 
teaching of mathematics. 

Fifty of the sixty-siy correlation coefficients in this matrix are sig- 
nificantly below the lotrer of the two reliability estimates for that pair 
of countries. (Significance is defined as having a Z score for the corre- 
lation more than I.65 standard errors below the Z score for lower of the 
reliability estimates. Kays, 1963.) The country \<±th the lowest corre- 
lations is Israel.' The correlations of Israel with other countries range 
from a low of 0.25 (the lowest in the matrix) to a high of O.70. Other 
countries in vihich ever3'' correlation is significantly belovj the reliability 
estimates are the United States, Belgium and France. The lov;er correlations 
are the result of differences in the pattern of item difficulties relative 
to the mean. This would seem to indicate that the organization and emphasis 
on topics v/ithin the mathematics curriculum has some unique aspects for 
these four countries. This seems reasonable when one is aware that these 
tests included item testing both higher and lower mental process scores 
and the topics of new mathematics, elementary and intermediate algebra, 
Euclidian and analytic geometry, calculus, analysis and set theory. Perhaps 
it should be noted that the test mean for Israel is very high while that 
for the United States is low, so that this is more than merely indicating 
level of performance • 

These significantly lower correlations led to fxirthor examination of the 
item difficulties. The next step was to calculate the intraclass correlation 
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for each itftm. This statistic provides an indication of thr between countr;/- 
hetero;yeniety relative to the within group horio/^eniety for each itein. 
The intra class correlation can be interpreted as a proportion of explained 
variance. If there are between group difference's these intraclass corre- 
lations would be greater than zero for each item. Also, if the relative 
performance of th(*se items v*as the sa^e, then the intraclass correlation 
should be approxiir.atcly equal for all itens. Table 2 shows this is not 
the case. These intraclass correlations range from a low of #020 to a 
high of .271. 

Tkble 2 presents only a subset of this fiirther analysis of the item dif- 
ficulties. Only thirty of the iterns are presented here. These items ar© 
the ten with the highest intraclass correlations, ten with the lowest, arid 
the middle ten. This table also provides the percentage of correct re- 
sponses to these iteris for six of the countries and for all t\:elve countries 
combined. The last line in the table is the percentage correct on the 
total test of sixty-nine items. The second last lino is the percentage 
correct on the subset of the ten itens with the highest intraclass corre- 
lations. Co^nparison of these p^trcentages for 69 items and for 10 items 
reveals the percentages for the United States and Australia are even lower 
for ten items than for sixty-nine items while for Isrrol the reverse is 
true, i.e., the percentage is even higher for ten items than for sixty- 
nine items, rhas is sl»nply an indication that those ten Items are more 
sensitive to between country differences than is the entire test. For the 
items with highest intraclass correlation the typical range for the per- 
centage of correct responses for these items is about 60. For the middle 
ton items on the intraclass correlation the typical range is 33» For the 
lowest ten the typical range in percentages is 28. 

Hiis led to further analys^es of th© item difficulties. The percentages 
right on individual items for each country were compared xrith the percentage 
for all countries combined. In general, one would expect the three countries 
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vrith lower mwns to have items with lovrer percentages correct and would 
expect hipjher percentar^es correct in the three countries vdth higher means. 
This tends to be true. In Tab3.e 2 plus (+) sif^ns are used in the three 
low countries to indicate item difficulties above the levnl for all tvrelve 
countries cornbined. For the ten items vTith the highest intraclass corre- 
lation, in the United Staples only one item has a -f and there are only two 
+*s for Australia. Ferha;;)3 this caused the percent ri^jht for the first 
ten items in each of these countries to be lower than for the entire test. 
These are the itenis i/hl-sh v;ere even more difficult than expected in these 
countries, tegative (-) signs are used i'l the three high countries to 
indicate ite:a difficulties which are below the level for all twel\ra coun- 
tries. For these same ten itens there is only one negative for Israel. 
This contributed to a higher percenUige right on these ten itens than in 
the total test. 

Further analysis of item difficulties v/ithin these sets of thi*oi* countries 
coxild be pursued. For example, the item with the intraclass correL'ition of 
.216 (rank of 3) has a p«rcant;ige right of 3 In the United States and ^2 
in Finland. For this same item England had 16 percent right while Israel 
had 62 percent. Thf^se two pairs of countries have similar means, so this 
item appears to indicate differences in mathematics ability which is not 
shown in the country means. Other items also could reveal such differences, 
such as the one \<±th a rank of six or a rank of oight. On item six, the 
percent right for the United States is 31 while it is for Finland. On 
item eight in England the percent correct is 70 v;hilo in Israel it is only 
36. These two comparisons are the reverse of that shown in item three. 
The effect of combining these items would be to show little difference in 
performance for each of theso pairs of countries. However, the item dif- 
ficulties clearly indicate there is a difference. 



Importance 

These results indicate that analysis of item difficulties can be an important 
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techn3>que in evaluating educatioml programs. Those procedures aid in 
identifyin;; unique aspects of a program which may be different from another 
program. Such uniquenesses maj'- be hidden by examination of test scores 
and differences between means. Thus, these techniques should be an im- 
portant aid in program evaluation. 
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TABLE 2. ITEM STATISTICS; INTRACLASS CORRELATIONS AND ITEbl DIFFICULTIES 

Percentage Correct 

Intraclass Twelve 











A lie f* T* 


r xnx • 


tjapan 




Israe. 


1 


.271 


64 


25 


67+ 


70+ 


48- 


91 


90 


2 


.226 


29 


13 


11 


48+ 


52 


28 


90 


3 


.216 


18 


3 


7 


42+ 


29 


16- 


62 


4 


.215 


48 


39 


29 


75+ 


79 


44- 


75 


5 


.20f 


41 


11 


47+ 


17 


31- 


81 


71 


6 




23 


31+ 


10 


4 


8- 


29 


50 


7 




20 


14 


10 


4 


59 


12- 


66 


8 


.189 i 


54 


58 


25 


72+ 


85 


70 


36- 


9 


.188 I 


55 


20 


45 


63+ 


49- 


75 


82 


10 


.178 


23 


19 


16 


9 


21- 


68 


43 


30 


,085 


87 


84 


90+ 


94+ 


92 


92 


90 


31 


.083 


52 


34 


34 


51 


54 


65 


58 


32 


.081 


49 


43 


29 


45 


67 


48 


91 


33 


.080 


28 


16 


18 


24 


43 


40 


43 


34 


.079 


72 


48 


62 


73 


82 


83 


88 


35 


.074 


71 


54 


62 


80+ 


90 


80 


64 


36 


.070 


• ■ 41 


47+ 


32 


47+ 


53 


49 


36- 


37 


.069 


43 


23 


38 


53+ 


52 


57 


87 


38 


.069 


48 


29 


46 


44 


53 


63 


83 


39 


.068 


47 


32 


34 


47 


54 


55 


53 



60 


.038 


66 


51 


59 


62 


71 


79 


77 


61 


.038 


63 


57 


53 


56 


64 


76 


86 


62 


.037 


67 


49 


71+ 


55 


77 


82 


89 


63 


.035 


17 


25+ 


13 


11 


25 


15- 


1- 


64 


.035 


29 


33+ 


23 


11 


30 


38 


25- 


65 


.031 


31 


37+ 


25 


14 


28- 


46 


27- 


66 


.021 


60 . 


53 


53 


74+ 


61 


70 


55- 


67 


.021 


21 


21 


15 


24+ 


33 


19 


20 


68 


.021 


62 


63 


64 


58 


55- 


76 


60 


69 


.020 


51 


57+ 


44 


37 


49 


58 


70 


First 
in 




37.5 


23.4 


26.2 


40.6+ 


46.0 


51.2 


66.5 



.199 A6.9 37.0 39.8 5A.1 57.7 61.2 

These percentages are higher than for all twelve countries combined even 
though the means suggest lower percentages. 

- These percentages are lower than for all twelve countries combined even 
though the means suggest higher percentages.* 
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